DataFrame Null Values

NULL value can be identified in multiple manner.Many people confuse it with BLANK or empty string however there is a difference. NULL means unknown where BLANK is empty. Alright now let’s see what all operations are available in Spark Dataframe which can help us in handling NULL values.
 
Create DataFrame 
import org.apache.spark.sql.types._
val schema = new StructType()
    .add("empno", IntegerType)
    .add("ename", StringType)
    .add("job", StringType)
    .add("mgr", IntegerType)
    .add("hiredate",StringType)
    .add("sal", IntegerType)
    .add("comm", IntegerType)
    .add("deptno",IntegerType)

val data = Seq(
      Row(7369, "SMITH", "CLERK", 7902, "17-Dec-80", 800, 20, 10),
      Row(7499, "ALLEN", "SALESMAN", 7698, "20-Feb-81", 1600, 300, 30),
      Row(7521, "WARD", "SALESMAN", 7698, "22-Feb-81", 1250, 500, null),
      Row(7566, "JONES", "MANAGER", null, "2-Apr-81", 2975, 0, 20),
      Row(null,null,null,null,null,null,null,null),
      Row(7654, "MARTIN", "SALESMAN", 7698, "28-Sep-81", 1250, 1400, null),
      Row(7698, "BLAKE", "MANAGER", 7839, "1-May-81", 2850, 0, 30),
      Row(7782, "CLARK", "MANAGER", 7839, "9-Jun-81", 2450, 0, 10),
      Row(7788, "SCOTT", "ANALYST", 7566, "19-Apr-87", 3000, 0, 20),
      Row(7839, "KING", "PRESIDENT", 0, "17-Nov-81", 5000, 0, 10),
      Row(7844, "TURNER", "SALESMAN", 7698, "8-Sep-81", 1500, 0, 20),
      Row(7876, "ADAMS", "CLERK", 7788, "23-May-87", 1100, 0, 20))

var empDF=spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
empDF.show

Find the records which Department No is NULL


Drop rows which any columns as NULL.
empDF.na.drop("any").show()

Drop rows which has all columns as NULL.
empDF.na.drop("all").show()

 
Drop rows if it does not have "n" number of columns as NOT NULL.
empDF.na.drop(5).show()
 
 
empDF.na.drop(6).show()


Drop rows when all the specified column has NULL in it. Default value is any so “all” must be explicitly mention in DROP method with column list.
empDF.na.drop("all",Array("comm","deptno")).show()

Drop rows which has any value as NULL for specific column
empDF.na.drop(Array("comm")).show()
empDF.na.drop(Array("comm","deptno")).show()
 
Fill all the “numeric” columns with default value if NULL
empDF.na.fill(-1).show()

Replace value in specific column with default value. If default value is not of datatype of column then it is ignored.
 
 
Fill values for multiple columns with default values for each specific column.
 empDF.na.fill(-1,Array("deptno")).show()

Fill all the “string” columns with default value if NULL
empDF.na.fill(Map("deptno" -> -1, "hiredate" -> "2999-12-31")).show()



Spark dataframe filter both nulls and spaces
col1   col2
1      abc
2      null
3      null
4  
5      def

val myDF = Seq((1, "abc"),(2,null),(3,null),(4, ""),(5,"def")).toDF("col1", "col2")
myDF.filter(($"col2" =!= "") && ($"col2".isNotNull)).show 

No comments:

Post a Comment